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ABSTRACT 


The  acoustic  properties  of  a  number  of  different  speech  sounds  as 
they  appear  in  several  phonetic  contexts  are  described.  This  re¬ 
port  supplements  an  earlier  report  on  the  same  topic  and  presents 
^ata  for  stop  and  nasal  consonants  in  prestressed  position,  for 
ine  timing  of  vowels,  and  for  acoustic  events  following  stressed 
vowels.  The  aims  of  this  survey  are  to  provide  an  indication  of 
the  kinds  of  acoustic  attributes  that  should  be  extracted  from 
the  speech  signal  in  a  potential  scheme  for  machine  recognition 


;  seech , 


hi.  SO  liiCiU36d 


Ls  a  discussion  of  the  roles  that  must 


be  played  by  acoustic  data  and  tv  Hr. 


constraints  in 


schemes  for  automatic  speech  recognition. 
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of  unaspirated  labial  stop  consonants  uitcr/y 
lines)  and  about  dO  msec  later  during  vowel 
transition  (light  lines).  Soectra  are  obtained 
from  19 -channel  filter  bank  described  in  SK  I; 
curves  are  identified  by  sample  numbers  repre¬ 
senting  10-msec  intervals  . 

2.  spectra  sampled  within  about  10  msec  of  release 
cf  aspirated  labial  stop  consonants  (heavy  lines) 
ana  arout  2 Q  msec  after  onset  of  voicing  of 
following  vowel  (light  lines).  [See  legend  of 


Spectra  sampled  within  about  20  msec  of  release 
of  unaspirated  postdental  step  consonants  (heavv 
line- )  ana  about  20  msec  later  during  vowel 
transition  (light  lines).  The  initial  spectra 
represent  maximum  levels  in  each  filter  over  the 
frieation  interval,  as  indicated  by  the  saroic 
numbers.  [Ste  legend  of  rig.  1 . 1  . 
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1.  INTRODUCTION 

An  earlier  report*  presented  a  survey  of  the  acoustic  properties 
of  a  number  of  different  speech  sounds  as  they  appear  In  several 
phonetic  contexts.  That  survey  (hereafter  referred  to  as  SK  I) 
was  prepared  primarily  for  the  use  of  investigators  who  are  inter¬ 
ested  in  developing  procedures  for  machine  recognition  of  speech, 
since  any  such  procedure  must  include  a  component  that  extracts 
certain  acoustic  properties  or  attributes  from  the  speech  signal. 
The  material  in  SK  I  is  intended  to  provide  an  indication  of  the 

kinds  of  acoustic  attributes  that  should  be  extracted  in  a  recog¬ 
nition  scheme. 

The  data  presented  in  SK  I  are  far  from  complete  and  include,  pri- 
manly,  an  analysis  of  stressed  vowels  and  of  consonants  in  pre¬ 
stressed  position.  The  purpose  of  this  supplementary  report  is  to 
discuss  the  acoustic  properties  of  speech  sounds  in  other  phonetic 

environments,  particularly  when  the  sounds  occur  after  a  stressed 
vowel. 

The  point  of  view  in  this  study  is  that  a  given  "speech  sound”  is 
characterized  by  several  features  or  properties  which  can  bs  used 
to  categorize  all  speech  sounds  into  natural  classes  depending  on 
their  manner  and  place  of  articulation.  Precuently,  the  invariant 
attribute  or  property  that  characterizes  a  natural  class  or  a  fea¬ 
ture  is  an  articulatory  position,  posture,  or  maneuver.  The  spe¬ 
cific  acoustic  attributes  associated  with  a  segment  that  possesse- 
a  given  feature  may  depend  to  some  extent  on  the  features  of  that 
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segment  and  on  the  features  of  adjacent  segments.  For  example, 
the  ec  rustic  attribute  that  characterizes  a  coronal  consonant 
(i.e.,  a  consonant  produced  with  the  tongue  tip)  may  depend  on 
whether  the  consonant  is  a  stop,  a  fricative,  or  a  nasal,  and  may 
also  depend,  in  some  cases,  on  the  features  of  the  vowel  or  other 
segment  that  follows  the  consonant.  An  acoustic  description  of  a 
feature  must,  therefore,  include  data  for  the  feature  as  it  occurs 
in  various  contexts . 

This  supplementary  report  also  includes  some  remarks  (Sec.  4 )  con¬ 
cerning  the  problem  of  automatic  speech  recognition.  These  re¬ 
marks  suggest  some  reasons  why  automatic  recognition  of  speech  can 
be  expected  to  have  only  limited  success  and  point  out  the  kinds 
of  ’’knowledge”  a  speech  recognizer  must  possess  in  order  to  inter¬ 
pret  properly  the  acoustic  events  and  identify  the  speech  units. 

The  data  presented  in  SK  I,  as  well  as  in  this  supplementary  re¬ 
port,  were  based  primarily  on  analysis  of  a  series  of  utterances 
(including  nonsense  syllables)  and  isolated  words  produced  by 
three  speakers.  All  of  these  utterances  were  processed  by  a  bank 
of  19  band-pass  filters  whose  rectified  and  smoothed  outputs  were 
sampled,  quantized  and  printed  out.  Spectrograms  of  the  recorded 
material  were  also  produced.  The  details  of  the  analysis  proce¬ 
dures  are  described  in  SK  1. 


Report  No.  1871 


Bolt  Beranek  and  Newman  Inc. 


2.  FURTHER  DATA  ON  CONSONANTS  IN  PRESTRESSED  POSITION 
2.1  Stop  Consonants 

Report  SX  I  presented  some  samples  of  data  that  indicated  the 
acoustic  attributes  that  (i)  identify  prestressed  stop  consonants 
as  a  class.  (2)  distinguish  between  voiced  and  voiceless  stop  con¬ 
sonants,  and  (3)  identify  place  of  articulation  of  a  stop  conso¬ 
nant  as  a  labial  /b,  p/,  a  dental  /d,  t/,  or  a  velai  /g,  k/.  This 
Section  gives  further  data  (particularly  with  regard  to  place  of 
articulation)  on  stop -consonant  properties.  Furthermore,  attri- 

V  u 

butes  of  the  affricates  /c/  and  /j /  are  discussed,  and  additional 
detail  regarding  the  characteristics  of  stop  consonants  in  initial 
consonant  clusters  are  presented.  It  should  be  noted,  inciden¬ 
tally,  that  stop  onsonants  participate  in  most  of  the  allowed 
initial  consonant  clusters  in  English. 

The  average  duration  of  the  stop  gap  for  an  initial  stop  consonant 
(precede i  by  an  unstressed  schwa)  is  in  the  range  110—130  msec 
(see  SK  I,  Table  VII,  p.  42).*  These  durations  apply  both  to  ini¬ 
tial  single  consonants  and  to  stops  in  initial  position  in  clusters 
(as  in  /tr/,  /bl/,  etc.).  For  individual  utterances,  these  pre¬ 
stressed  stop-gap  durations  may  be  as  short  as  60  msec  or  as  long 
as  150  msec . 

The  aspiration  interval  following  release  of  an  initial  consonant 
always  identifies  a  voiceless  stop  as  opposed  to  a  voiced  step. 

As  noted  in  SK  I,  this  aspiration  interval  is  in  the  range  50—100 
msec  for  initial,  single  voiceless  stops.  The  aspiration  duration 


*As  noted  elsewhere  in  SK  I,  this  duration  can  become  much  shorter 
fer  stop  consonants  in  other  phonetic  environments. 
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is  usually  at  the  upper  e»rad  of  this  range  (and  sosetises  »>y 
greater)  when  the  voiceless  stop  is  the  initial  eles aent  in  a 
sonant  cluster  (such  as  /fcr/ ,  /cw/).  Generally,  when  -a  stos 
sonant  follows  an  /s/  in  an  initial  cluster  - _ 

following  the  release  of  the  stoo.  There  ^a*  be  --  h^uf 
interval  having  a  duration  less  than  10  —sec  h-?  ? k#  — - 
arMJ  20—30  -sec  for  the  /i /  and  /V.  1  These  as  tributes  t* 

sonants  in  consonant  clusters  _ 

p.  67.) 
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2.1.1  Labial  stop  consonants 

Following  release  of  the  consonant  /■/  there  is  only  weak  fo~  -co¬ 
existent)  frication  noise,  and  the  transient  interval  is  v^rv 

fcrief  (less  than  10  ssec'-.  Several  exansles  c*  s  _  _ -  ^ 

about  10  nsec  after  release  #\e  *  kc  /v  /  , 

rom?ents  are  shown  in  Fig.  1.  (Since  spectra  were  sailed  onlv" 
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r  =  e  periods i.  Figure  3  also  displays  the  spectra  sailed 
ee  after  the  bsrst,  during  the  transition  Into  the  ?o*el, 
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4  Spectra  sampled  within 
about  10  ssec  of  re  - 
lease  of  aspirated 
postdental  stop  conso¬ 
nants  (heavy  tines)  and 
about  20  ssec  after  on¬ 
set  of  voicing  of 
following  vowel  (ligkt 
lines).  [See  legend  of 
Fig.  1.] 


1  #f  i.t  V'  T  ;»«2 


Report  No.  1871 


Bolt  Beranek  and  Newman  Inc. 


2.1.3  Velar  stop  consonants 

The  burst  cf  frication  noise  following  release  of  the  consonant  /g/ 
usual? y  has  a  duration  of  about  30  msec.  This  longer  interval  of 
frication  noise  provides  one  cue  for  distinguishing  /g/  from  /b/, 
since  the  frication  noise  for  /b/  is  much  briefer.* 

Spectra  for  the  /g /  burst,  displayed  in  Fig.  5,  were  obtained  by 
taking  the  maximum  value  for  each  sampled  filter  output  during 
this  30-msec  interval.  As  before,  the  spectra  sampled  20-30  msec 
after  the  burst  are  shown,  as  are  spectra  corresponding  to  the  un¬ 
aspirated  stop  consonant  in  the  syllable  /sk  / . 

In  order  to  interpret  these  data,  it  is  necessary  to  make  a  dis¬ 
tinction  between  /g /  preceding  a  back  vowel  and  /g /  preceding  a 
front  vowel.  In  the  environment  of  a  front  vowel  (/i/  in  this 
case),  the  /g/  burst  has  major  energy  peaks  in  the  high-frequency 
region  (at  about  2*100  Hz  and  3300  Hz  in  this  example).  These 
peaks  are  ccsparabie  In  amplitude  to  the  vowel  spectrum  In  this 
frequency  region  (corresponding  to  F2  and  F3  for  a  front  vowel). 
When  /g/  precedes  a  back  vowel,  the  major  spectral  peak  In  the 
burst  Is  at  a  lower  frequency.  This  peak  is  in  the  vicinity  of 
the  second  formant  of  the  vowel  and  Is  again  comparable  In  ampli¬ 
tude  to  the  vowel  spectrum  amplitude  in  that  frequency  range. 


•The  existence  of  a  burst  of  frication  noise  for  /g/  is  often  not 
easy  to  see  from  the  display  of  the  outputs  from  the  19-channel 
filter  bank.  Frication  noise  cannot  be  distinguished  from 
voiced  excitation,  since  the  averaging  times  cf  the  smoothing 
filters  are  too  great  (see  SK  I ,  Fig.  2,  p.  8).  However,  the 
frlcatior  interval  can  be  seen  easily  on  the  spectrogram  (which 
Is  produced  with  a  much  shorter  averaging  time  —  of  the  order 
of  3  msec). 
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Same  as  Fig.  3  except  for  unaspirated 
velar  stop  consonants. 
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Similar  remarks  may  do  made  with  regard  to  the  spectra  of  /k/  shown 
in  Pig.  6.  Here,  the  spectra  at  the  onset  were  sampled  about  10 
msec  after  the  release  of  the  stop  consonant.  The  spectrum  of  the 
/k/  burst  has  a  high-frequency  peak  (about  2600  Hz)  when  it  pre¬ 
cedes  a  front  vowel  and  a  peak  at  lower  frequencies  when  it  pre¬ 
cedes  a  back  vowel  (or  when  it  precedes  /r/,  /!/,  or  /w/). 


2.1.4  Affricate  consonants 

There  are  two  affricate  consonants  in  English  —  voiceless  /c/  and 
voiced  /]/  —  both  of  which  can  oo'ur  i.i  either  initial  or  final 
position  in  a  syllable.  The  duration  of  the  stop  gap  preceding 
the  release  of  these  consonants  in  prestressed  position  is  similar 
to  that  for  other  stop  consonants  (in  the  range  90—120  msec  for 
the  utterances  examined  in  this  study).  There  is  a  long  frication 
interval  following  the  release;  the  average  duration  of  frication 
for  /c/  is  about  100  msec  and  for  /j/  about  70  msec  for  the  three 
speakers  used  in  this  study. 


The  spectrum  of  the  /c/  during  the  frication  interval  is  very  sisi- 

V  V  v 

lar  to  that  of  the  /s/,  and  the  /)/  and  /z/  also  have  comparable 

y 

spectra.  Figure  7  shows  the  spectra  of  /c/  preceding  each  of  three 

y 

vowels  and  of  /j/  preceding  the  vowel  /a/,  all  produced  by  cne 
speaker.  The  spectrum  peaks  at  about  2500  Hz  and  3300  Hz,  corre¬ 
sponding  to  F 3  and  F4,  and  are  evident  in  all  of  these  spectra. 

The  effect  of  voicing  can  be  seen  at  the  low-frequency  end  of  the 
/ j /  spectrum. 
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FIG. 


FREQUENCY  (KHZ) 
051152  35 


_  V 

7  Spectra  of  three  examples  ov  /c/ 
and  one  example  of  / j /  in  the 
vowel  environments  shown,  sampled 
during  the  frication  intervals. 
Speaker  KS. 
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t.Z  Nasal  Consonants 


It  was  observed  in  SK  I  that  nasal  consonants  as  a  class  are  ehas 
acterized  by  an  interva’  of  vintc  ir  which  t_ hA  so^^trus 
relatively  fixed,  followed  by  a  rapid  change  in  spectrum  as  tne 
consonant  is  released.  The  g d*? c t ri]"=  wit t iri t^rv s  1 
has  a  major  peak  in  the  range  200—300  ops  (filters  1  and  2  for  U 
analyzing  sy stent  used  in  this  study),  and  the  amplitude  at  high 
frequencies  is  relatively  low  Nasal  consonants  are  always  char- 
acterxzed  by  a  minimum  in  spec t run  amplitude  around  S-OG  Hz  { fil¬ 
ter  **  in  the  analysing  oyster,  used  here). 


1  ?  ■?  *i  «■“;!  *  yts 


>  -  *  -  V  - 


i he  two  nasal  consonants  tnat  can  occur  in  trsotreooed  r-os it  =  —  ■  r= 
English  can  be  distinguished  from  each  other  on  the  basis  of  the 
rapid  changes  occurring  in  the  signal  at  the  instant  of  consonantal 
rclcsse •  ane  nasal  mumur  o re e ci i n g  release  does  not  shewn  con¬ 
sistently  different  characterist ics  for  /“/  as  opposed  to  /n/. 
(So»6  inaicaticn  of  the  kinds  of  acoustic  attributes  that  seoarate 
/id/  from  /n/  has  been  presented  in  5E  ~  *?r.  -  oo  * 

more  detailed  data  for  /-/  and  /n/  are  shown  in  Firs.  S  and  9,  re¬ 
spectively  .  Each  portion  of  these  Figures  -hows  a  oair  of  snectra 
obtained  f rcr.  the  1 9  — c han ne  1  filter  tank.  One  of  the  so^^ct^a  ^t^e 
one  drawn  with  heavy  lines}  is  castled  issdists! v  before  tve  con¬ 
sonantal  release,  and  the  other  _s  sassled  about  20  nsec  later. 

[The  spectra  sampled  during  rapid  changes  in  the  signal  are,  of 
course,  very  much  dependent  on  the  characteristics  of  t*= a  *na" vr”— 
ing  system,  particularly  the  smoothing  filters  (see  SK  I,  Fist.  2 
p.  8)]. 


When  the  nasal  consonant  pre  edes  the  vowel  /o/  (or,  sore  g* nerally 
when  it  precedes  a  back  vowel),  there  is  a  rapid  and  large  Jump  in 
spectral  energy  in  the  vicinity  of  1.7  kHz  following  release  of  /n/ 
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**C: 


Spectra  sampled  immediately  preceding  release  of  con¬ 
sonant  fmf  {heavy  lines)  and  about  20  nsec  after  re¬ 
lease  into  following  vowel  (light  lines). 


Sift*  is  fi 


IS?1 


Sslt  ierkisek 


/s./ .  when  * 


f-^-»  £ 


is  i  fr^t  *v*ti  C/€/  In  im  fi^lts  sreaffs ) ,  thert  Is 
^  is  s|»ftrU  trsi^  sttr  tl»  hi #rh-fre ~ uer.cT  tsge 


fi^i  fStfCy  fCSSfSt 


Ln  FI*,  f#  fslicwflne  the  /nf  ^eie^se  but  £  ssaeli 


f £  1  1 a* i r.f  the  /^/  release, 


.  .*  £IIS  ihSSCSte  Hat  the 


r^-yi  y1h,g^g- 


tte^ 


a  sisel  :r*  tc  he  ti- 


-.  _t 


vnwti ,  erst  there  er^  ns 


-mc>  r*i 


f  f  JC.K  » 


terse  to  5ltt Irtish  /it/  fm 


kfpcrf  .  1$?1 


fpek  IN  Imp 


'hiiM1 
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are  imposed  by  the  preceding  voxel :  For  example,  /ts/  is  net 
possible  in  final  position  except  in  a  txo-rorphe re  situation 


p/  is  co's itle  tut 

/xoisp/  is  net 

pe 

missibie;  and  for  a 

*  f.—  in  kh'^h  *  fi 

rst  tlenent  is 

a 

nasal,  both  segments 

the  sane  place  of 

articulation. 

x»^  V 

lx5?p  is  possible  but 

♦  ?  e 


that  can  foil ox  %  st-essec 


: luster)  folloxed  by  an 


ne 


ocnar.ts  C  f rs'iJ, 


A  second  class  of  sessent  sequences  v-. 
voxel  is  a  consonant  {or  consonan 
stressed  vexe  1 ,  ^ooo lily  followed  in  turn  ty  a  f i nil  cor, »o— 
tressed  sy  liable  can  terminate  in  se  or  txc 

it  id)  or  in  a  s entrant  corse- 
,  r,  1/),  In  many  situations,  the  un¬ 
stressed  voxel  A/  ^elloxed  by  a  s entrant  consonant  sinply  re¬ 
cant.  Examples  of  xords  of  this  type 
iifcl,  fsteer,  and  lese?.  -w. ..  ; » 

could  fee  represented  as  /ax/  and  Ay/,  re- 
...  the  latter  case. 


f  |  /i‘  ^ 


to  a 

syllabic 

fezker. 

nai 

syllables 

vexy 

fcr  *■  ?  — i  ’ 

,  ri.tr  .« 

r  ?-  £“=_ 

/¥/  ^  J--—  / 
rf ;  ivi  ,- 

y  ,  i 

stressed 

the  te mi  r.at  i or.  reduces 


sith  secondary 


Lb  le  can  be  folloxed  by  a  syiiabie 
?  in  the  xords  eeecx  and  sccoveS . 


fhe  acrostic  events  associated  xt 
.x  a  stressed  voxel  can  be  peer? 
die  stressed  voxel,  particularly  i 

- n  r  ^  =  *  =  *  » 

hinal  consonant  clusters,  t a}  dats 
nonsonants,  and  Co)  data  on  final 


?  §-v. 


U  X  A  "V  -  * 


h  the  phonetic  sepaents  that  l 
•lei  In  terns  of  CD  the  effeci 

e  trope rt it- 
e  eo-rpe -vents 
on  post stressed  intervocalic 
unstressed  svllahles.*  As 


i  «  '  Vt  ■  Vi  w  >--* 


•ocoe  discussion  of  iters  i* 
?  * 


has  teen  given  previously 
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an  Initial  voiceless  stop  consonant  differs  from  a  voiced  stop  on 
the  basis  of  the  duration  of  arui7.*ation  preceding  the  onset  of 
voicing;  cr,  the  presence  of  an  initial  stop  consonant  requires  a 
silent  interval  whose  duration  exceeds  a  certain  value  (probably 
around  50  msec);  or,  as  noted  below,  the  voicing;  feature  of  a 
pcststresscd  stop  or  fricative  consonant  is  determined  by  timing 
of  events  in  the  preceding  vowel.  Some  knowledge  of  the  timing 
of  speech  events  also  helps  to  establish  where  certain  acoustic 
characteristics  in  the  signal  are  likely  to  occur  and  hence  indi¬ 
cates  where  specific  acoustic  measurements  on  the  signal  are  to 
oe  made . 


In  the  kinds  of  utterances  examined  in  this  study,  the  stressed 
vowel  is  either  the  final  vowel  or  is  followed  by  one  ether  vowel 
that  does  not  have  primary  stress.  Consequently,  comments  on 
timing  are  restricted  to  utterances  of  this  type. 

As  i>«di  sated  in  SK  I.  pure,  nondiphthongized  stressed  vowels  in 
English  can  be  either  long  (o,  d  ,  ae  )  or  snort  (i,  e  ,  a  ,  v).  It 
might  be  argued  that  a  long  vowel  is,  in  essence,  a  vowel-vowel 
combination,  although  the  length  of  a  long  vowel  is  not  quite 
double  that  of  a  short  vowel.  In  the  environment  b — b,  the  dura¬ 
tions  of  long  vowels  in  single-syllable  utterances  are,  on  the 
average,  320  msec,  whereas,  on  the  basis  of  the  data  examined  in 
this  supplementary  study,  the  durations  of  short  vowels  are  180 
msec.  Other  stressed  vowels,  like  /i ,  e,  o,  u/,  are  always  fol¬ 
lowed  by  soncrant  consonants  to  yield  /iy,  ey,  ow,  uw/ ;  sometimes 
/a/  and  /o/  are  followed  by  sonorants  to  give  the  diphthongs  /ay/, 
/ow/,  or  /oy/ .  These  dJphthongs  and  diphthongized  vowels  are 
about  equal  In  length  to  long  vowels  in  the  environment  L — b.  The 
vowel  /t  /  can  probably  also  be  classified  as  a  long  vowel;  it  is. 
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in  some  sense,  a  distorted  or  degenerate  version  of  a  short  vowel 
(/i/  or  /a/)  followed  by  the  consonant  /r/. 


The  duration  of  a  stressed  vowel  or  diphthong  that  is  the  nucleus 
of  the  final  syllable  in  an  utterance  is  influenced  by  the  voic¬ 
ing  feature  of  the  final  consonant.  If  the  consonant  is  voiced, 
the  vo  iel  is  lengthened;  if  it  is  voiceless,  the  vowel  is  short¬ 
ened.  The  vowel  is  also  lengthened  if  there  is  no  consonant  in 
final  position.  Average  durations  of  vowels  following  the  voiced 
and  voiceless  obstruents  and  nasals,  as  reported  by  House,*  are 
shown  in  Table  I . 


TABLE  I.  Average  vowel  durations  for  symmetrical  consonant- 

vowel  -consonant  syllables  in  English.  Data  for  each 
consonant  represent  averages  over  16  utterances 
(12  vowels,  3  speakers).  [From  A.S.  House,  "On 
Vowel  Duration  in  English,"  J.  Acoust.  Soc.  Amer. 

33,  No.  9,  1174-1178  (1961).] 


Consonantal  Environment 

Vowel  Duration 

(msec ) 

b 

270 

d 

310 

P 

150 

t 

150 

m 

240 

n 

260 

*See  legend  for  Table  I  for  complete  reference. 
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r 


s 


When  a  strerted  \rowel  is  in  the  final  syllable  and  is  followed  by 
a  consonant  cluster  (consisting  of  a  consonant  followed  by  an  ob¬ 
struent  consonant,  as  in  bond ,  fault ,  etc.),  then  the  voicing  fea¬ 
ture  of  the  final  consonant  influences  the  duration  of  the  vowel- 
sonorant  combination  in  much  the  same  way  that  single  vowel  dura¬ 
tions  are  influenced.  The  altered  duration  occurs  both  on  the 
vowel  segment  and  the  sonorant  portion.  When  the  final  consonant 
cluster  is  a  sequence  of  two  obstruent  consonants,  as  in  lots  and 
sods ,  then  both  of  these  segments  always  have  the  same  voicing 
feature,  and  this  voicing  feature  again  influences  vowel  duration 
in  the  same  way.  These  effects  on  vowel  duration  for  the  few 
single-syllable  words  examined  in  this  study  are  given  in  Table  II. 
For  the  syllabic  nuclei  with  sonorant  consonants,  the  vowel-plus- 
ponorant  combinations  are  somewhat  greater  than  the  durations  of 
the  single-vowel  nuclei,  except  when  the  sonorant  in  /r/. 

When  a  stressed  vowel  is  followed  by  another  syllable  tnat  does 
not  have  primary  stress,  the  presence  of  this  additional  syllable 
causes  a  reduction  in  length  of  the  stressed  vowel.  The  vowel 
durations  tabulated  ip  SK  I  show  marked  differences  for  stressed 
vowels  in  bisyllabic  wcrds  having  stress  on  the  first  syllable. 

In  fact,  it  is  generally  observed  that  any  vowel  in  the  final 
syllable  of  an  utterance  is  longer  than  the  same  vcwel  in  another 
position. 

If  rough  estimates  of  average  durations  are  made  from  the  data  in 
Tables  IV  and  V  of  SK  I  (see  SK  t y  pp.  if  ul,d  20),  one  finds  that 
the  average  duration  of  short  vowels  in  bisyllabic  words  is  about 
110  msec  and  that  of  long  vowels  (or  vowel-pius-sonorant  combina¬ 
tions)  is  about  180  msec.  For  monosyllabic  words,  the  correspond¬ 
ing  averages  are  about  180  and  320  msec  as  noted  above.  Thus,  ap¬ 
proximately  70  msec  of  duration  is  added  t^  s  resscd  short  vowels 
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TABLE  II.  Durations  of  vowel  and  sonorant  segments  in 
monosyllabic  words  terminating  in  voiced  and 
voiceless  consonants  (average  data  for  three 
speakers ) . 


Word 

Vowel 

Durati on 

(msec ) 

Sonorant 

Duration 

(msec) 

Vowel  and 
Sonorant 

(msec ) 

gaunt 

o 

OJ 

50 

290 

fault 

200 

60 

2o0 

heart 

100 

80 

l80 

bond 

310 

14  0 

450 

bald 

310 

110 

420 

hard 

200 

130 

330 

lots 

150 

— 

150 

sods 

300 

— 

300 

when  they  appear  in  syllable-final  position;  and  about  140  msec  is 
added  to  long  vowels  in  the  same  situation. 

In  the  two-syllable  utterances  with  stress  on  the  first  syllable, 
the  voicing  characteristics  of  an  intervocalic  consonant  following 
the  stressed  vowel  have  only  a  weak  influence  on  the  duration  of 
the  stressed  vowel.  This  influence  appears  to  be  stronger  when 
the  stressed  .yllable  foimC  a  single  morpheme  (e.g.,  seated  vs 
seeded ),  and  the  effect  is  often  much  smaller  for  one-morpheme 
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words  (e.g.,  vohirf  vs  rapid).  Presumably  in  the  latter  case,  the 
voicing  feature  of  the  consonant  is  signaled  by  other  acoustic 
events  such  as  aspiration  or  vocal-cord  vibration  during  the  stop 
gap.  Thus  the  vowel  durations  in  seated  and  beaded  (averaged  over 
the  three  speakers)  are  110  and  160  msec,  respectively,  while  the 
vowel  durations  in  rapid  and  rabid  are  170  and  200  msec,  respec¬ 
tively  . 

These  data  on  the  timing  and  duration  of  events  within  stressed 
vowels  may  be  summarized  as  follows.  The  duration  of  a  stressed 
vowel,  a  diphthong,  or  a  vowel-plus-sonorant  sequence  (excluding 
nasals)  may  range  from  about  80  msec  to  more  than  JJ00  msec  (in 
"normal"  speech  production),  depending  on  the  features  of  the 
vowel  and  on  phonetic  events  following  the  vowel.  Measurements 
on  the  vowel  spectrum  in  the  region  50—100  msec  following  release 
of  the  initial  consonant  or  consonant  cluster  can  usually  serve  to 
make  an  iaentif ication  of  the  place  of  articulation  of  the  vowel, 
i.e.,  the  t  atus  of  the  features  high,  low,  back,  and  rounded.  If 
the  vowel  interval  extends  beyond  this  time,  additional  measure¬ 
ments  in  the  following  time  interval  can  be  made  to  determine 
whether  the  vowel  is  long  or  short  and  whether  it  is  a  diphthong 
or  is  followed  by  a  sonorant  consonant.  If  it  is  determined  that 
the  vowel  is  followed  by  an  obstruent  consonant,  then  duration 
measurements  on  the  vowel  may  be  needed  to  determine  whether  the 
consonant  is  voiced  or  voiceless. 

Unstressed  vowels  are,  of  course,  generally  shorter  than  stressed 
vowels,  but  their  durations  undergo  influences  that  are  similar 
to  those  for  stressed  vowels.  Thus,  an  unstressed  vowel  that  is 
the  nucleus  of  a  final  syllable  (as  in  famous)  is  usually  longer 
than  an  unstressed  vowel  in  an  earlier  syllable  (as  in  about). 
Likewise,  the  voicing  characteristics  of  a  final  consonant 
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following  an  unstressed  vowel  influences  the 
example,  the  average  unstressed  vowel  durati 
msec  (for  the  three  talkers  in  this  study), 
is  160  msec;  but  these  durations  are  suite  v 
talker  to  another. 


vowel  duration, 
on  In  v&uaous  is 
whereas  in  e-dg^s 
sr iufcle  from  one 


As  noted  in  SK  I,  the  duration 
single  consonant  is  always  les 
stressed  vowel  and  precedes  an 
nant  in  prestressed  position. 


f  the  constricted  interval  for 
when  the  consonant  follows  a 
ns tressed  vowel  than  for  a  cons 
ri - n  in b ^  ^ ^  v. a ^ 


f* 


15— 2C  msec  when  the  intervocalic  consonant 
nasal  and  may  be  as  long  as  150  msec  for  a 
consonant  in  this  position. 


is  a  dental  stop  or 
voiceless  fricative 


3.3  Single  Poststressed  Consonants  in  Final  Position 

The  acoustic  attributes  that  characterize  a  consonant  in  final 
position  are  often  markedly  different  from  the  attributes  or  t^e 
same  consonant  ir.  prestressed  position,  particularly  with  regard 
to  the  way  voicing  and  manner  of  articulation  are  signaled.  The 
following  subsections  present  brief  comments  cn  the  various  classes 
of  consonants  that  appear  in  final  position. 


3.3.1  Final  fricative  consonants 

The  spectra  of  voiceless  fricative  consonants  are  very  much  the 
same  whether  the  consonants  appear  in  prestressed  position  or  In 
poststressed  final  position.  Thus,  the  spectra  shown  in  Fig.  12 
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of  SK  i  (see  3K  T,  p.  45)  indicate  the  sain  features  of  voiceless 
fricative  consonant  spectra  for  any  phonetic  environment . 

opectra  sampled  in  the  fricat^v® 

*  —  — *-  . . - o  -  j-Vi  *ivii  t.  a  iiHai  vorcea 

fricative  consonant  indicate  that  this  segment  5s  voiceless  or 

only  weakly  voiced  throughout  most  of  its  length.  This  lack  of 

voicing  (or  weak  voicing)  can  fc-e  seen  in  the  spectrograms  shown 

in  Fig.  13  of  SK  I  (see  SK  I,  p.  47).  Examples  of  spectra  sampled 

in  the  middle  of  the  consonantal  interval  for  both  initial  and 

final  voiced  fricatives  are  compared  here  in  Fig.  1C. 

Ail  of  these  examples  show  appreciably  less  low-frequency  energy 
in  the  spectra  of  the  final  consonants  and  indicate  that  there  is 
little  cr  no  vocal-cord  vibration  in  the  consonantal  intervals. 
There  are  also  some  differences  in  the  spectra  at  high  frequencies, 
suggesting  that  there  may  be  some  differences  in  tongue  position 
and  degree  of  constriction  fer  the  final  consonants  relative  to 
the  initial  consonants. 


vne  spectra  m  the  constricted  intervals  of  final  voiced  fricative 
consonants  are,  therefore,  very  similar  to  those  for  voiceless 
consonants .  - ne  Difference  between  a  final  voiced  and  voiceless 

fricative  is  signaled  largely  fcy  the  time  course  of  the  vowel  pre¬ 
ceding  the  consonant .  hot  only  is  the  vowel  longer  before  a 
voiced  fricacive  (as  noted  in  Sec.  3.2),  but  the  vowel  configura¬ 
tion  in  the  few  tens  of  milliseconds  prior  to  the  onset  of  fri ca¬ 
tion  noise  for  the  consonant  is  generally  more  constricted.  This 
difference  is  illustrated  m  rig.  11,  which  shows  vowel  spectra 
sampled  about  70  msec  prior  to  onset  of  the  consonant  in  the  syl¬ 
lables  /sas/  and  /zoz/ .  It  is  evident  that  the  first -formant  fre¬ 
quency  is  much  lower  in  the  case  of  the  voiced  consonantal  environ¬ 
ment,  with  the  result  that  there  is  much  less  energy  in  the  vowel 
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ra  of  voiced  fricative  consonants  sampled  during 
ricted  intervals  preceding  a  stressed  vowel 
t  lines)  and  in  terminal  position  following  the 
stressed  vowel  (heavy  lines).  Sample  numbers 
dentified.  Speaker  KS. 
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lines).  Speaker  ICS. 
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FIG. 12  Spectra  sampled  10  esec  prior  to  consonantal 
closure  (light  lines}  and  70  ssec  prior  to 
consonantal  closure  (heavy  lines)  in  syllables 
/ob/  and  /od/.  Speaker  T. S 
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like  /bait/  (sec- 
nasals  are  quite  similar,  with 
or  filter  2  and  with  a  rapid  drop  in  spectrum  at  filters  3  and  4 
The  /!/  spectrum  has  a  broad  low-frequency  peak  extending  over 
filters  2-4,  corresponding  to  the  closely  spaced  first-  ana 
second-formant  frequencies.  For  /r/  the  first -formant  peak  is  at 
filter  2  and  there  is  another  important  peak  in  the  vicinity  of 
1500  Hz,  corresponding  to  the  combination  of  and  F..  These 
major  spectral  features  appear  also  co  characterize  these  final 
consonants  for  the  other  two  speakers. 

From  data  of  this  kind,  it  is  probable  that  simple  measurements  on 
the  spectra  could  be  used  to  distinguish  the  nasals  as  a  class, 
and  could  separate  /!/  and  /r/,  Reference  to  the  vowel  spectra  in 
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SK  I  (see  SK  I,  Pig.  5,  p.  22)  suggests  that  some  difficulties 
might  arise  in  separating  /!/  from  /w/  in  final  position  using 
these  kinds  of  spectral  data.  However,  other  information  relat¬ 
ing  to  the  timing  of  the  syllable  could  be  used  for  these  pur¬ 
poses.  For  example,  bow  (as  in  "bow  and  arrow")  and  bowl  have 
quite  different  durations  for  the  vowel  and  sonorant  regions. 
Furthermore,  pairs  like  ball  and  bow  (as  in  "bow  of  a  ship")  are 
separable  because  the  vowel  qualities  are  different. 

Some  indication  of  the  distinction  between  /m/,  /n/,  and  /q/  in 
final  position  can  be  seen  in  the  graphs  of  Fig.  14.  This  Figure 
shows  spectra  sampled  about  20  msec  prior  to  consonantal  closure 
and  again  30  msec  after  consonantal  closure.  The  distinguishing 
attribute  of  the  /n/  is  the  relatively  sharp  drop  in  the  spectrum 
at  high  frequencies,  particularly  in  the  vicinity  of  the  second 
or  third  formants.  This  drop  occurs  presumably  because  a  zero  is 
inserted  in  the  vocal-tract  transfer  function  in  this  frequency 
range  at  the  instant  when  consonantal  closure  occurs.  In  the  case 
of  /q/,  an  energy  peak  remains  in  the  spectrum  in  the  vicinity  of 
the  second  formant  of  the  vowel  after  consonantal  closure  occurs. 
The  detailed  characteristics  of  these  nasal  consonants  may  depend 
on  the  preceding  vowel.  In  some  cases,  a  vowel  preceding  a  nasal 
consonant  is  nasalized,  and  the  influence  of  this  nasalization 
can  be  seen  in  the  vowel  spectrum.  For  the  vowel  /i/  in  the  syl¬ 
lable  in,  in  Fig.  14,  the  nasalization  is  manifested  as  a  bump  in 
the  spectrum  in  the  vicinity  of  filter  5 ,  at  a  point  where  a  for¬ 
mant  would  not  be  expected  for  this  vowel. 


36 


LITUDE 


Report  No.  1871 


Bolt  Beranek  and  Newman 


Report  No.  1871 


Bolt  Beranek  and  Newman  Inc. 


4.  REMARKS  ON  THE  USE  OF  ACOUSTIC  DATA  IN  SCHEMES 
FOR  MACHINE  RECOGNITION  OF  SPEECH 

Approaches  to  automatic  speech  recognition  have  generally  followed 

two  different  paths: 

(1)  Recognition  is  restricted  to  a  closed  set  of  utterances,  each 
produced  in  isolation.  A  set  of  properties  is  extracted  from 
the  acoustic  signal  corresponding  to  each  utterance,  and  a 
decision  procedure  operate.-  on  these  properties  to  identify 
the  utterance.  It  is  necessary  to  *  elude  a  learning  phase 
in  which  distributions  f  '.he  properties  for  each  utterance 
(repeated  many  times)  are  determined.  The  decision  procedures 
are  based  on  observation  of  these  distributions. 

(2)  Fhe  second  approach  involves  no  adaptive  procedure  or  loom¬ 
ing  phase  but  recognizes  that  an  utterance  is  constructed 
from  a  rather  limited  set  of  linguistic  units.  The  recogni¬ 
tion  routine  attempts  to  identify  these  smaller  units  (such 

,  phonemes  or  phonetic  features)  and,  from  the  results  of 
these  identifications,  to  recognize  the  entire  utterances. 

In  this  situation,  a  lexicon  that  specifies  the  inventory  of 
utterances  in  terms  of  the  sma- ler  phonetic  units  or  features 
may  or  may  not  be  a  part  of  the  overall  recognition  procedure, 
out  usually  it  is  a  necessary  component. 


One  basic  problem  in  the  first  approach  is  the  selection  of  prop¬ 
erties  sufficiently  invariant  from  one  repetition  of  an  utterance 
to  another.  This  selection  is  a  most  difficult  task  for  several 
reasons.  Perhaps  the  most  important  reason  stems  from  the  fact 
that,  for  any  utterance  in  a  langur ge ,  there  are  rather  severe 
constraints  on  the  patterns  of  features  that  are  allowed,  i.e.. 


on  the  phoneme  classes  and  phoneme  sequences  t 


oat  can  occur  m 
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the  language.  Because  of  these  constraints,  it  is  not  necessary 
for  the  acoustic  signal  to  carry  precise  information  concerning 
every  feature.  A  native  speaker  of  the  language  somehow  knows  the 
rules  (usually  called  phonological  rules)  governing  the  allowed 
patterns,  and  when  he  listens  to  another  speaker,  he  can  "fill  in" 
information  lacking  in  the  signal  by  invoking  these  rules.  Conse¬ 
quently,  a  speaker  has  a  great  deal  of  choice  in  how  precisely  he 
generates  acoustic  information  concerning  the  various  features. 
Often  the  acoustic  information  that  corresponds  to  a  given  feature 
may  be  nonexistent  or  fragmentary ,  since  the  phonological  rules 
(together  with  a  lexicon)  can  specify  the  feature  or  at  least  can 
indicate  that  the  feature  is  highly  probable.  It  is  quite  pos¬ 
sible,  however,  that  a  feature  having  weak  or  nonexistent  acoustic 
correlates  in  one  utterance  may  have  important  and  essential  acous 
tic  correlates  in  another  utterance  of  the  same  word  or  phrase. 

Thus,  for  example,  a  speaker  producing  a  simple  word  like  legal 
has  several  possibilities  open  to  him.  The  intervocalic  /g/  may 
be  produced  as  a  velar  fricative  rather  than  as  a  stop.  (This 
substitution  would  cause  no  confusion  to  an  English  listener  who 
knows  that  a  velar  fricative  is  not  allowed  in  his  language  and, 
therefore,  must  be  interpreted  as  a  stop.)  The  final  unstressed 
syllable  may  be  produced  either  as  a  schwa  followed  by  /!/  or 
simply  as  a  syllabic  /!/,  (No  confusion  arises  hc^e,  since  there 
can  be  no  vowel  contrast  in  this  unstressed  position.)  Further¬ 
more,  the  stressed  vowel  could  be  produced  as  /i/  or  as  /I/  or  as 
anything  in  between  without  causing  confusion.  Thus  there  is  a 
whole  range  of  possibilities  for  generating  the  entire  word. 

A  recognition  scheme  based  cn  a  learning  or  adaptive  procedure 
would  have  to  be  exposed  to  all  of  tnese  (and  other)  possibilities 
for  the  word  during  the  learning  phase,  if  It  were  he  operate 
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satisfactorily  for  several  speakt  or  even  for  one  speaker  un¬ 
less  the  speaker  were  carefully  trained.  It  might  be  argued,  of 
course,  that  as  long  as  some  of  the  properties  of  a  given  test 
utterance  match,  or  are  similar  to,  the  set  of  stored  properties 
for  an  item  in  the  lexicon,  then  correct  recognition  might  be 
achieved.  That  is,  in  a  recognition  task  involving  a  limited  vo¬ 
cabulary  there  might  be  sufficient  redundancy  that  only  certain 
attributes  of  an  input  utterance  need  to  provide  a  match  with  the 
stored  attributes.  Except  for  situations  involving  a  rather  lim¬ 
ited  inventory  of  carefully  selected  utterances,  however,  this 
approach  tc  speech  recognition  would  have  little  chance  of  success. 

The  second  approach  to  speech  recognition  (in  which  phonetic  units 
or  features  are  identified)  also  represents  a  most  difficult  prob¬ 
lem,  since  5 t  is  necessary  to  store  within  the  recognizer  knowledge 
of  the  phonological  rules  that  are  possessed  by  a  native  speaker 
of  the  language.  Also,  and  more  important,  the  acoustic  represen¬ 
tation  of  a  feature  may  vary  tremendously  depending  on  the  environ¬ 
ment  of  other  features  in  which  it  occurs ,  as  has  been  noted  in 
this  report  and  in  SK  I.  The  potential  advantages  of  working  to¬ 
ward  a  representation  of  an  utterance  in  terms  of  features  are 
(1)  at  least  some  features  for  some  environments  have  well  defined 
and  reasonably  stable  acoustic  correlates,  and  (2)  a  representation 
based  on  features  is  a  convenient  framework  in  terms  of  which  the 
influence  of  environment  on  the  acoustic  correlates  of  a  feature 
can  be  stated  and  the  rules  governing  the  allowed  patterns  of  fea¬ 
tures  in  a  language  can  be  specified.  Thus,  in  the  example  cited 
above,  i4  would  be  easy  to  indicate  in  these  terms  that  .a  velar 
fricative  is  not  allowed  (or  alternatively,  that  recognition  of 
the  fact  that  a  consonant  _s  a  velar  automatically  requires  that 
it  he  a  stop).  It  would  also  be  a  simple  matter  to  require  that 
all  syllables  with  no  stress  consist  of  a  vocalic  schwa  segment 
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that  may  may  not  be  followed  by  a  consonant  and  to  require  that 
syllabic  sonorant  consonants  be  represented  in  these  terms. 

The  point  of  this  discussion  is  that  a  native  speaker  of  a  language 
has  knowledge  of  the  constraints  on  the  possible  patterns  of  fea¬ 
tures  and  feature  sequences  that  can  occur  in  an  utterance.  If 
the  acoustic  manifestation  of  a  given  feature  is  absent  or  dis¬ 
torted,  this  feature  can  be  filled  In  by  the  listener  on  the  basis 
of  his  knowledge  of  the  constraints.  The  particular  features  that 
are  distorted  or  missing  in  a  given  utterance  may  vary  from  one 
repetition  of  the  utterance  to  the  next.  Some  acoustic  aspects  of 
an  utterance  must,  of  course,  provide  clear  and  unequivocal  infor¬ 
mation  concerning  the  identity  of  a  feature  or  features;  the  acous¬ 
tic  correlates  of  other  features  may  be  sufficiently  distorted  that 
they  can  only  provide  corroborative  information  when  the  context 
already  permits  strong  hypotheses  to  be  made  concerning  the  Iden¬ 
tity  of  these  features. 

In  view  of  these  remarks,  the  acoustic  data  presented  in  this  re¬ 
port  and  in  SK  I  cannot  be  expected  to  encompass  the  properties  of 
phonetic  segments  in  all  possible  phonetic  environments.  An  at¬ 
tempt  is  made  to  give  examples  of  acoustic  attributes  in  situations 
where  they  are  reasonably  unambiguous.  Some  data  are  presented, 
however,  to  indicate  how  these  attributes  may  be  modified  in  other 
phonetic  environments.  Furthermore,  these  acoustic  data  represent 
only  a  part  of  the  knowledge  that  must  be  available  to  a  machine 
that  is  to  recognize  speech.  Also,  the  machine  must  be  equipped 
with  a  set  of  rules  specifying  the  constraints  on  patterns  of  fea¬ 
tures  that  are  allowed  In  English  and  with  a  strategy  for  taking 
these  rules  into  account. 
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13  abstract 

The  acoustic  properties  of  a  number  of  different  speech  sounds  as  they 
appear  in  several  phonetic  contexts  are  described.  This  report  supple¬ 
ments  an  earlier  report  on  the  same  topic  and  presents  data  for  stop 
and  nasal  consonants  in  prestressed  position,  for  the  timing  of  vowels, 
and  for  acoustic  events  following  stressed  vowels.  The  aims  of  this 
survey  are  to  provide  an  indication  of  the  kinds  of  acoustic  attributes 
that  should  be  extracted  from  the  speech  signal  in  a  potential  scheme 
for  machine  recognition  of  speech.  Also  included  is  a  discussion  of 
the  roles  that  must  be  played  by  acoustic  data  and  by  linguistic  con¬ 
straints  in  schemes  for  automatic  speech  recognition.  _ _ 
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